Orchestrate the Multiplicity of LLMs

28 December 2024 — Rust

Introduction

For a while now, I’ve been captivated by Language Models (LLMs) and their countless variants: GPT-4, Claude, Ollama, Phind, DeepSeek… The web is full of innovative projects, but I quickly started feeling a certain burden: constantly switching crates or libraries for each LLM API, while fumbling to make these different approaches coexist in a single codebase. That’s when I challenged myself to design a single, easy-to-use Rust tool that can drive multiple LLM backends. And that’s how RLLM was born.

In this article, I’ll explain how I got there, what RLLM has to offer, and above all why I wanted to approach things in a very “Stripe-like” way: namely, by providing a fluid and intuitive API, even if what’s happening behind the scenes remains complex.

Note: I would like to personally thank @philpax, who enabled me to continue the work through the Rust crate llm. The library is now developed within the llm crate, while rllm serves as just a wrapper.

My Motivations: “A Single Controller for All LLMs”

I often found myself juggling multiple crates:

openai-rs for GPT,
anthropic-rs for Claude,
some homegrown tinkering for Ollama or Phind,
plus parallel scripts for evaluation.

It worked, but it felt like a puzzle. I spent my time configuring each library, juggling multiple APIs, and writing adaptation code to compare responses from different LLMs.

RLLM aims to unify all these features under one roof, while leaving the user free to choose which backend to enable. The goal isn’t to enforce any specific usage, but to provide a common thread.

The Basics: A “Stripe-like” Rust API

The Builder

I’ve always loved Stripe-style APIs, where you configure a client by chaining .foo(...) methods. So I modeled RLLM on this concept:

let llm = LLMBuilder::new()
    .backend(LLMBackend::OpenAI)
    .api_key("sk-...")
    .temperature(0.7)
    .model("gpt-3.5-turbo")
    .build()?;

The result is a single object (Box<dyn LLMProvider>) that’s ready to handle chat(...) or complete(...) calls depending on the selected backend—no more needing to remember five different crates.

Optional Modules

I then added various optional modules or features:

chain to chain multiple prompts,
embedding to retrieve vectors from an LLM,
evaluator to compare, score, or rank responses.

Each module can be enabled in an à la carte style, without cluttering the user who only wants the basics.

Multi-Scoring and Advanced Evaluation

A key feature is a more nuanced evaluation of LLM responses, going beyond simple binary scoring. Inspired by community feedback on Hugging Face, this system will combine multiple axes of evaluation: JSON validity, semantic relevance, moderation of tone, freedom from toxicity, and so on.

The planned multi-scoring feature will have a fluid API:

// Coming soon
let evaluator = LLMEvaluator::new(vec![openai, anthropic])
   .scoring(|resp| if is_valid_json(resp) {2.0} else {0.0})
   .add_criterion(|resp| if resp.contains("yes") {1.0} else {0.0})
   .self_explain(true);

A particularly innovative feature will be self_explain(true) (Roadmap 2025), allowing the LLM to self-evaluate its own answers and provide more transparency about its reasoning process.

Multi-turn Conversations (Roadmap 2025)

The next version of RLLM will bring advanced conversation management, especially useful for chatbots and assistants. Among the planned capabilities:

Accumulate a dialogue in a Vec<ChatMessage>,
Analyze it via evaluate_dialogue(...),
Get a score array per turn.

This feature is currently in testing and will be available in an upcoming release.

Parameter Exploration (Roadmap 2025)

To simplify prompt optimization, we’re developing a system for automatic parameter exploration. Here’s a preview of the planned syntax:

// Feature planned for Q2 2024
let param_sets = vec![
   ParamSet { temperature: 0.2, top_p: 0.9, description: "calm" },
   ParamSet { temperature: 0.8, top_p: 0.95, description: "creative" },
];

let exploration = evaluator.evaluate_param_exploration(&param_sets, &messages)?;

Other Improvements on the Way

Best-of: Automatic selection of the best response among multiple LLMs
Benchmarking Hooks: CSV export and logging for performance tracking

// Planned features
evaluator.export_csv("results.csv", &results)?;
evaluator.log_results("Prompt1-run", &results);

These features are under development and will be integrated progressively into future versions of RLLM.

Best-of: Pick the Best

Many developers launch a chat on multiple LLMs and then automatically want to pick the highest-scoring response. That’s why I added .best_of(...). It’s essentially a shortcut: we do evaluate_chat(...), then max_by(score). This saves you from repeating “find the top” logic everywhere.

Conclusion: RLLM, a Single Point of Entry

I hope this introduction piques your interest in RLLM. I have three main objectives:

Unify access to multiple backends (OpenAI, Anthropic, Ollama, etc.) in Rust.
Simplify multi-criteria and multi-turn evaluations without requiring a dozen separate scripts.
Remain a “small modular building block,” flexible rather than a monolithic framework.

I’m convinced we can push even further—linking a multi-step agent, delving deeper into chain-of-thought bridging, or incorporating external “tools.” But for now, RLLM covers the essentials, sparing you from having to learn five different APIs and offering helpful scoring and export mechanisms.

If you’re curious, don’t hesitate to experiment, dig into the source code, or share your feedback about how you use it!
Enjoy exploring—I can’t wait to see what innovations arise with LLMs and how RLLM might help you get there.